Wine industry shows a recent growth spurt, as social drinking is on the rise. A key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. It would be interesting if we could predict the wine type given some of its properties. This could later be scaled up to predict prices of each individual wine which every wine seller dreams of.
To predict we'll use basic neural network models. The simplest and easiest Python library out there to use is Keras. Its a really simple library to get started with to learn about deep learning.
Understanding The Data However, before we start loading in the data, it might be a good idea to check how much we really know about wine (in relation with the dataset, of course).
The data consists of two datasets that are related to red and white variants of the Portuguese “Vinho Verde” wine.
In [82]:
#Loading In The Data from uci repositories
# Import pandas
import pandas as pd
# Read in white wine data
white = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep=';')
# Read in red wine data
red = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=';')
1) Fixed acidity: acids are major wine properties and contribute greatly to the wine’s taste. Usually, the total acidity is divided into two groups: the volatile acids and the nonvolatile or fixed acids. Among the fixed acids that you can find in wines are the following: tartaric, malic, citric, and succinic.
2) Volatile acidity: the volatile acidity is basically the process of wine turning into vinegar. In the U.S, the legal limits of Volatile Acidity are 1.2 g/L for red table wine and 1.1 g/L for white table wine.
3) Citric acid is one of the fixed acids that you’ll find in wines. It’s expressed in g/dm3dm3 in the two data sets.
4) Residual sugar typically refers to the sugar remaining after fermentation stops, or is stopped. It’s expressed in g/dm3dm3 in the red and white data.
5) Chlorides can be a major contributor to saltiness in wine. Here, you’ll see that it’s expressed in.
6) Free sulfur dioxide: the part of the sulphur dioxide that is added to a wine and that is lost into it is said to be bound, while the active part is said to be free. Winemaker will always try to get the highest proportion of free sulphur to bind. This variables is expressed in mg/dm3dm3 in the data.
7) Total sulfur dioxide is the sum of the bound and the free sulfur dioxide (SO2). Here, it’s expressed in mg/dm3dm3. There are legal limits for sulfur levels in wines: in the EU, red wines can only have 160mg/L, while white and rose wines can have about 210mg/L. Sweet wines are allowed to have 400mg/L. For the US, the legal limits are set at 350mg/L and for Australia, this is 250mg/L.
8) Density is generally used as a measure of the conversion of sugar to alcohol. Here, it’s expressed in g/cm3cm3.
9) Sulphates are to wine as gluten is to food. You might already know sulphites from the headaches that they can cause. They are a regular part of the winemaking around the world and are considered necessary. In this case, they are expressed in g(potassiumsulphatepotassiumsulphate)/dm3dm3.
10) Alcohol: wine is an alcoholic beverage and as you know, the percentage of alcohol can vary from wine to wine. It shouldn’t surprised that this variable is inclued in the data sets, where it’s expressed in % vol.
11) Quality: wine experts graded the wine quality between 0 (very bad) and 10 (very excellent). The eventual number is the median of at least three evaluations made by those same wine experts.
12) pH or the potential of hydrogen is a numeric scale to specify the acidity or basicity the wine. As you might know, solutions with a pH less than 7 are acidic, while solutions with a pH greater than 7 are basic. With a pH of 7, pure water is neutral. Most wines have a pH between 2.9 and 3.9 and are therefore acidic.
Lets start of by getting a quick view of each dataset.
In [83]:
# Print info on white wine
white.info()
In [84]:
red.info()
Our red wine dataframe has relatively lower (1599) observations than that of wine (4898). All the values are float except for the quality variable which is ratings given by wine experts on a scale of 1-10.
In [85]:
# Print info on red wine
red.head()
Out[85]:
In [86]:
#Print a random sample of 5 obs in white wine dataset
white.sample(5)
Out[86]:
In [87]:
# Double check for null values in `red`
pd.isnull(red).sum()
Out[87]:
In [88]:
import matplotlib.pyplot as plt
#Split with same y-axis
fig, ax = plt.subplots(1, 2,figsize=(10, 8))
ax[0].hist(red.alcohol, 15, facecolor='red', ec="black", lw=0.5, alpha=0.5)
ax[1].hist(white.alcohol, 15, facecolor='white', ec="black", lw=0.5, alpha=0.5)
fig.subplots_adjust(left=0, right=1, bottom=0, top=0.8, hspace=0.05, wspace=0.2)
ax[0].set_title("Red Wine")
ax[1].set_title("White Wine")
#ax[0].set_ylim([0, 800])
ax[0].set_xlabel("Alcohol (% Vol)")
ax[0].set_ylabel("Frequency")
ax[1].set_xlabel("Alcohol (% Vol)")
ax[1].set_ylabel("Frequency")
#ax[1].set_ylim([0, 800])
fig.suptitle("Distribution of Alcohol in % Vol")
plt.show()
One would notice that most of the wines made have a 9~10% alcohol present in them. Ofcourse some would notice that 10-12% is also frequent though not as much as 9%. Moreover the y-axis scales are different because of the disproportionate dataset (unbalanced) observations between red and white.
Next, one thing that interests me is the relation between the sulphates and the quality of the wine. As you may know, sulphates can cause people to have headaches and I’m wondering if this infuences the quality of the wine. What’s more, I often hear that women especially don’t want to drink wine exactly because it causes headaches. Maybe this affects the ratings for the red wine?
In [89]:
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
ax[0].scatter(red['quality'], red["sulphates"], color="red", label="Red wine")
ax[1].scatter(white['quality'], white['sulphates'], color="white", edgecolors="black", lw=0.5, label="White wine")
ax[0].set_xlabel("Quality")
ax[1].set_xlabel("Quality")
ax[0].set_ylabel("Sulphate")
ax[1].set_ylabel("Sulphate")
ax[0].set_xlim([0,10])
ax[1].set_xlim([0,10])
ax[0].set_ylim([0,2.5])
ax[1].set_ylim([0,2.5])
fig.subplots_adjust(wspace=0.5)
ax[0].legend(loc='best')
ax[1].legend(loc='best')
fig.suptitle("Wine Quality v/s Sulphate")
plt.show()
So, from the graphs above we can see that for the same values (most of it) the ratings stay constant which doesn't necessarily affect the quality as i guessed earlier. Although, we can see that the red wine has higher amount of sulphates in them which explains why drinking red wine causes headache and women prefer white over red.
Apart from the sulphates, acidity is one of the important wine characteristics that is necessary to achieve quality wines. Great wines often balance out acidity, tannin, alcohol and sweetness. Some more research taught me that in quantities of 0.2 to 0.4 g/L, volatile acidity doesn’t affect a wine’s quality. At higher levels, however, volatile acidity can give wine a sharp, vinegary tactile sensation. Extreme volatile acidity signifies a seriously flawed wine.
In [90]:
import numpy as np
np.random.seed(570)
redlabels = np.unique(red['quality'])
whitelabels = np.unique(white['quality'])
fig, ax = plt.subplots(1, 2, figsize=(10, 8))
redcolors = np.random.rand(6,4)
whitecolors = np.append(redcolors, np.random.rand(1,4), axis=0)
for i in range(len(redcolors)):
redy = red['alcohol'][red.quality == redlabels[i]]
redx = red['volatile acidity'][red.quality == redlabels[i]]
ax[0].scatter(redx, redy, c=redcolors[i])
for i in range(len(whitecolors)):
whitey = white['alcohol'][white.quality == whitelabels[i]]
whitex = white['volatile acidity'][white.quality == whitelabels[i]]
ax[1].scatter(whitex, whitey, c=whitecolors[i])
ax[0].set_title("Red Wine")
ax[1].set_title("White Wine")
ax[0].set_xlim([0,1.5])
ax[1].set_xlim([0,1.5])
ax[0].set_ylim([6,15.5])
ax[1].set_ylim([6,15.5])
ax[0].set_xlabel("Volatile Acidity")
ax[0].set_ylabel("Alcohol")
ax[1].set_xlabel("Volatile Acidity")
ax[1].set_ylabel("Alcohol")
ax[0].legend(redlabels, loc='best', bbox_to_anchor=(1.3, 1))
ax[1].legend(whitelabels, loc='best', bbox_to_anchor=(1.3, 1))
fig.suptitle("Alcohol - Volatile Acidity")
fig.subplots_adjust(top=.85, wspace=0.7)
plt.show()
In [91]:
import seaborn as sns
corr = red.append(white, ignore_index=True).corr()
fig, ax = plt.subplots(1, 1, figsize=(10, 8))
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values,cmap="autumn",linewidths=.2)
plt.show()
As you would expect, there are some variables that correlate, such as density and residual sugar. Also volatile acidity and type are more closely connected than you originally could have guessed by looking at the two data sets separately, and it was kind of to be expected that free sulfur dioxide and total sulfur dioxide were going to correlate.
Create a column to distinguish between red and white by giving red value 1 and white value 0. Why not simply label them 'red' and 'white'? This is because neural networks only works with numerical data and not labels, so it outputs probabilities which we have to later compute to one of the labels. (it's not that difficult!)
In [92]:
# Add `type` column to `red` with value 1
red['type'] = 1
# Add `type` column to `white` with value 0
white['type'] = 0
# Row bind white to red
wines = red.append(white, ignore_index=True)
wines.tail()
Out[92]:
In [93]:
wines.shape
Out[93]:
In [94]:
from sklearn.model_selection import train_test_split
# Specify the data
X=wines.ix[:,0:11]
# Specify the target labels and flatten the array
y= wines['type']
# Split the data up in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Standardization is a way to deal with these values that lie really far apart. The main reason we standardize values is that NN's behave well with smaller numbers. When given a choice its always preferrable to chose smaller numbers which makes it easier to train and reduce to chances of falling for the local optima.
In [95]:
# Import `StandardScaler` from `sklearn.preprocessing`
from sklearn.preprocessing import StandardScaler
# Define the scaler
scaler = StandardScaler().fit(X_train)
# Scale the train set
X_train = scaler.transform(X_train)
# Scale the test set
X_test = scaler.transform(X_test)
Now that we have our data preprocessed, we can move on to the real work: building our own neural network to classify wines.
A quick way to get started is to use the Keras Sequential model: it’s a linear stack of layers. You can easily create the model by passing a list of layer instances to the constructor, which you set up by running: model = Sequential()
In [96]:
# Import `Sequential` from `keras.models`
from keras.models import Sequential
# Import `Dense` from `keras.layers`
from keras.layers import Dense
# Initialize the constructor
model = Sequential()
# Add an input layer
model.add(Dense(12, activation='relu', input_shape=(11,)))
# Add one hidden layer
model.add(Dense(8, activation='relu'))
# Add an output layer
model.add(Dense(1, activation='sigmoid'))
In [97]:
# Model output shape
model.output_shape
# Model summary
model.summary()
# Model config
model.get_config()
# List all weight tensors
model.get_weights()
Out[97]:
In [98]:
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(X_train, y_train,epochs=10, batch_size=1, verbose=1)
Out[98]:
In [99]:
y_pred = model.predict(X_test)
# round predictions
y_pred = [round(x[0]) for x in y_pred]
In [100]:
y_pred[:5]
Out[100]:
In [101]:
y_test[:5]
Out[101]:
In [102]:
score = model.evaluate(X_test, y_test,verbose=1)
print(score)
# evaluate the model
scores = model.evaluate(X_test, y_test)
print("\n%s: %.2f%%" % (model.metrics_names[0], scores[0]*100))
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
In [103]:
# Import the modules from `sklearn.metrics`
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, cohen_kappa_score
# Confusion matrix
confusion_matrix(y_test, y_pred)
Out[103]:
In [104]:
# Precision
precision_score(y_test, y_pred)
Out[104]:
In [105]:
# Recall
recall_score(y_test, y_pred)
Out[105]:
In [106]:
# F1 score
f1_score(y_test,y_pred)
Out[106]:
In [107]:
# Cohen's kappa
cohen_kappa_score(y_test, y_pred)
Out[107]: